{
  "cells": [
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "ur8xi4C7S06n"
      },
      "outputs": [],
      "source": [
        "# Copyright 2020 Google LLC\n",
        "#\n",
        "# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
        "# you may not use this file except in compliance with the License.\n",
        "# You may obtain a copy of the License at\n",
        "#\n",
        "#     https://www.apache.org/licenses/LICENSE-2.0\n",
        "#\n",
        "# Unless required by applicable law or agreed to in writing, software\n",
        "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
        "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
        "# See the License for the specific language governing permissions and\n",
        "# limitations under the License."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "eHLV0D7Y5jtU"
      },
      "source": [
        "<table align=\"left\">\n",
        "  <td>\n",
        "    <a href=\"https://colab.research.google.com/github/GoogleCloudPlatform/ai-platform-samples/blob/master/notebooks/samples/aihub/house_prices/training_an_xgboost_model_with_ai_hub.ipynb\">\n",
        "      <img src=\"https://cloud.google.com/ml-engine/images/colab-logo-32px.png\" alt=\"Colab logo\"> Run in Colab\n",
        "    </a>\n",
        "  </td>\n",
        "  <td>\n",
        "    <a href=\"https://github.com/GoogleCloudPlatform/ai-platform-samples/blob/master/notebooks/samples/aihub/house_prices/training_an_xgboost_model_with_ai_hub.ipynb\">\n",
        "      <img src=\"https://cloud.google.com/ml-engine/images/github-logo-32px.png\" alt=\"GitHub logo\">\n",
        "      View on GitHub\n",
        "    </a>\n",
        "  </td>\n",
        "</table>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "23a306a87fae"
      },
      "source": [
        "# Training an XGBoost model with AI Hub"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "tvgnzT1CKxrO"
      },
      "source": [
        "### Overview\n",
        "\n",
        "AI Hub is a repository of plug-and-play AI components including end-to-end AI pipelines and out-of-the-box algorithms. The following is an example of using the [XGBoost AI Hub container](https://aihub.cloud.google.com/u/0/p/products%2F0ccd8a63-71a7-4e48-a68b-685692a62e92) to train a model with the AI Platform Training service and creating a model endpoint using AI Platform Prediction.\n",
        "\n",
        "AI Hub includes components that make it easy to run training jobs at scale on Google's cloud infrastructure. Without revising any code, users can run distributed training jobs on a variety of hardware (including GPU and TPU devices). These components offer native support for AI Platform Training and export trained model files that can be uploaded to AI Platform Prediction for generating inferences. The components also include a run report that provides practical insights into the behavior of the trained model, and a visual inspection of the training and validation error for each run.\n",
        "\n",
        "### Dataset\n",
        "\n",
        "The dataset used in this notebook includes residential real-estate data for homes in Ames, Iowa. The data are stored in a tabular format (.CSV) and include the sale price of 1,460 homes along with 79 explanatory features.\n",
        "\n",
        "The dataset comes from Kaggle's [competition to predict House Prices](https://www.kaggle.com/c/house-prices-advanced-regression-techniques). The Kaggle API can be used to download and import the data.\n",
        "\n",
        "### Objective\n",
        "\n",
        "The following notebook provides an example workflow of using an AI Hub component to train an XGBoost regression model and create an endpoint for generating predictions.\n",
        "\n",
        "The notebook includes a complete ML workflow from data ingestion to model training and deployment. The steps below can be used as a template for creating end-to-end workflows with XGBoost and tabular data.\n",
        "\n",
        "### Costs \n",
        "\n",
        "This tutorial uses billable components of Google Cloud Platform (GCP):\n",
        "\n",
        "* Cloud AI Platform\n",
        "* Cloud Storage\n",
        "\n",
        "Learn about [Cloud AI Platform\n",
        "pricing](https://cloud.google.com/ml-engine/docs/pricing) and [Cloud Storage\n",
        "pricing](https://cloud.google.com/storage/pricing), and use the [Pricing\n",
        "Calculator](https://cloud.google.com/products/calculator/)\n",
        "to generate a cost estimate based on your projected usage."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "BF1j6f9HApxa"
      },
      "source": [
        "### Set up your GCP project\n",
        "\n",
        "**The following steps are required, regardless of your notebook environment.**\n",
        "\n",
        "1. [Select or create a GCP project.](https://console.cloud.google.com/cloud-resource-manager). When you first create an account, you get a $300 free credit towards your compute/storage costs.\n",
        "\n",
        "2. [Make sure that billing is enabled for your project.](https://cloud.google.com/billing/docs/how-to/modify-project)\n",
        "\n",
        "3. [Enable the AI Platform APIs and Compute Engine APIs.](https://console.cloud.google.com/flows/enableapi?apiid=ml.googleapis.com,compute_component)\n",
        "\n",
        "4. Enter your project ID in the cell below. Then run the  cell to make sure the\n",
        "Cloud SDK uses the right project for all the commands in this notebook.\n",
        "\n",
        "**Note**: Jupyter runs lines prefixed with `!` as shell commands, and it interpolates Python variables prefixed with `$` into these commands."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "oM1iC_MfAts1"
      },
      "outputs": [],
      "source": [
        "from __future__ import absolute_import\n",
        "from __future__ import division\n",
        "from __future__ import print_function\n",
        "\n",
        "PROJECT_ID = \"[your-project-id]\"\n",
        "! gcloud config set project $PROJECT_ID"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "dr--iN2kAylZ"
      },
      "source": [
        "### Authenticate your GCP account\n",
        "\n",
        "**If you are using AI Platform Notebooks**, your environment is already\n",
        "authenticated. Skip this step."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "zgPO1eR3CYjk"
      },
      "source": [
        "### Create a Cloud Storage bucket\n",
        "\n",
        "**The following steps are required, regardless of your notebook environment.**\n",
        "\n",
        "When you submit a training job using the Cloud SDK, you upload a Python package\n",
        "containing your training code to a Cloud Storage bucket. AI Platform runs\n",
        "the code from this package. In this tutorial, AI Platform also saves the\n",
        "trained model that results from your job in the same bucket. You can then\n",
        "create an AI Platform model version based on this output in order to serve\n",
        "online predictions.\n",
        "\n",
        "Set the name of your Cloud Storage bucket below. It must be unique across all\n",
        "Cloud Storage buckets. \n",
        "\n",
        "You may also change the `REGION` variable, which is used for operations\n",
        "throughout the rest of this notebook. Make sure to [choose a region where Cloud\n",
        "AI Platform services are\n",
        "available](https://cloud.google.com/ml-engine/docs/tensorflow/regions). You may\n",
        "not use a Multi-Regional Storage bucket for training with AI Platform."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "cellView": "both",
        "id": "MzGDU7TWdts_"
      },
      "outputs": [],
      "source": [
        "BUCKET_NAME = \"[your-bucket-name]\"\n",
        "REGION = \"us-central1\""
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "-EcIXiGsCePi"
      },
      "source": [
        "**Only if your bucket doesn't already exist**: Run the following cell to create your Cloud Storage bucket."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "NIq7R4HZCfIc"
      },
      "outputs": [],
      "source": [
        "! gsutil mb -l $REGION gs://$BUCKET_NAME"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ucvCsknMCims"
      },
      "source": [
        "Finally, validate access to your Cloud Storage bucket by examining its contents:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "vhOb7YnwClBb"
      },
      "outputs": [],
      "source": [
        "! gsutil ls -al gs://$BUCKET_NAME"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "79e8bfbbac0c"
      },
      "source": [
        "### Install Kaggle API\n",
        "\n",
        "Use PIP to install the Kaggle API for downloading the House Prices dataset. Follow the [instructions on GitHub](https://github.com/Kaggle/kaggle-api) for generating an API token for Kaggle, then set the `KAGGLE_USERNAME` and `KAGGLE_KEY` ENV variables accordingly. "
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "43b769799f30"
      },
      "outputs": [],
      "source": [
        "! pip install --user kaggle"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "bc8d44b8b991"
      },
      "outputs": [],
      "source": [
        "%env KAGGLE_USERNAME YOUR-KAGGLE-USERNAME\n",
        "%env KAGGLE_KEY YOUR-KAGGLE-KEY"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "XoEqT2Y4DJmf"
      },
      "source": [
        "### Import libraries and download data"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "pRUOFELefqf1"
      },
      "outputs": [],
      "source": [
        "import zipfile\n",
        "import pandas as pd\n",
        "import tensorflow as tf\n",
        "import os\n",
        "\n",
        "from IPython.core.display import HTML\n",
        "import googleapiclient.discovery"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "f6c78e1cfd36"
      },
      "outputs": [],
      "source": [
        "# Import data from Kaggle\n",
        "# For documentation on using the Kaggle API for Python refer to the official repo: https://github.com/Kaggle/kaggle-api\n",
        "!~/.local/bin/kaggle competitions download -c house-prices-advanced-regression-techniques"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "d9b4d1a31d23"
      },
      "outputs": [],
      "source": [
        "# If you don't have a Kaggle account:\n",
        "! gsutil cp gs://cloud-samples-data/ai-hub/house-prices-advanced-regression-techniques.zip ."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "fe8f505ec1f2"
      },
      "outputs": [],
      "source": [
        "# Unzip the training and test datasets\n",
        "with zipfile.ZipFile('house-prices-advanced-regression-techniques.zip', 'r') as data_zip:\n",
        "    data_zip.extractall('data')\n",
        "# Remove the downloaded compressed file\n",
        "tf.io.gfile.remove('house-prices-advanced-regression-techniques.zip')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "6de81d344017"
      },
      "source": [
        "### Preprocess the data\n",
        "\n",
        "Import and preprocess the train and test datasets before training the model. The training and test sets each include 1460 examples. Below I've partitioned 10% of the training data as a validation set."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "3bd5497063e7"
      },
      "outputs": [],
      "source": [
        "# Import training data\n",
        "train_data = pd.read_csv('data/train.csv').sample(frac=1)\n",
        "train_data['set'] = 'train'"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "0ba4c7e0d2d9"
      },
      "outputs": [],
      "source": [
        "# Partition 10% of training data as a validation set\n",
        "train_data.iloc[0:(int(train_data.shape[0] * 0.1)), train_data.columns.get_loc(\"set\")] = 'validation'"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "19a616d1a159"
      },
      "outputs": [],
      "source": [
        "# Import test data\n",
        "test_data = pd.read_csv('data/test.csv')\n",
        "test_data['SalePrice'] = None\n",
        "test_data['set'] = 'test'\n",
        "# Pull Ids for test dataset for writing submission.csv file\n",
        "test_ids = test_data['Id']"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "114b3b78eb07"
      },
      "outputs": [],
      "source": [
        "# Combine training/validation/test sets into single DataFrame\n",
        "all_data = train_data.append(test_data)\n",
        "all_data = all_data.drop(labels='Id', axis=1)\n",
        "# Reorder columns\n",
        "cols = all_data.columns.tolist()\n",
        "del cols[-2:]\n",
        "cols.insert(0, 'SalePrice')\n",
        "cols.insert(0, 'set')\n",
        "all_data = all_data[cols]"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "06a044e483de"
      },
      "source": [
        "The data includes a large number of categorical features. For the sake of simplicity, assume that all integer features are ordinal and perform one-hot encoding for each of the string features."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "26a90b1942e2"
      },
      "outputs": [],
      "source": [
        "def one_hot_encode_features(features):\n",
        "    preprocessed_features = pd.DataFrame()\n",
        "\n",
        "    # One-hot encode categorical features\n",
        "    for col_name in features.columns:\n",
        "        # Assume that all numeric columns are continuous or ordinal\n",
        "        if col_name in ['set', 'SalePrice'] or features[col_name].dtype in ['int64', 'float64']:\n",
        "            preprocessed_features = pd.concat((preprocessed_features, features[col_name]), axis=1)\n",
        "        else:\n",
        "            preprocessed_features = pd.concat((preprocessed_features, pd.get_dummies(features[col_name])), axis=1)\n",
        "\n",
        "    return preprocessed_features\n",
        "\n",
        "\n",
        "all_data = one_hot_encode_features(all_data)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "4590cda9ee53"
      },
      "outputs": [],
      "source": [
        "# Revise column names\n",
        "col_names = ['set', 'SalePrice']\n",
        "col_names.extend(['feature_{}'.format(i) for i in range(all_data.shape[1] - 2)])\n",
        "all_data.columns = col_names"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "b6c891f95431"
      },
      "source": [
        "Replace missing values with the mean of each column from the training data. Then standardize the features by subtracting the column mean and dividing by the column standard deviation."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "bc46324fddc9"
      },
      "outputs": [],
      "source": [
        "# Split data into train/validation/test sets\n",
        "train = all_data.loc[all_data['set'] == 'train']\n",
        "validation = all_data.loc[all_data['set'] == 'validation']\n",
        "test = all_data.loc[all_data['set'] == 'test']\n",
        "# Remove 'set' column\n",
        "train = train.drop('set', axis=1)\n",
        "validation = validation.drop('set', axis=1)\n",
        "test = test.drop('set', axis=1)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "9161d9457c87"
      },
      "outputs": [],
      "source": [
        "# Pull column-wise mean and standard deviation from training set\n",
        "train_column_means = train.mean(axis=0)\n",
        "train_column_sd = train.std(axis=0)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "ddb14ade9141"
      },
      "outputs": [],
      "source": [
        "# Impute missing values with column mean\n",
        "train.iloc[:, 1:] = train.iloc[:, 1:].fillna(train_column_means[1:])\n",
        "validation.iloc[:, 1:] = validation.iloc[:, 1:].fillna(train_column_means[1:])\n",
        "test.iloc[:, 1:] = test.iloc[:, 1:].fillna(train_column_means[1:])"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "d97689032219"
      },
      "outputs": [],
      "source": [
        "# Standardize features for the train, validation and test sets\n",
        "def standardize_features(features, col_means, col_sds):\n",
        "    for i in range(features.shape[1]):\n",
        "        if col_sds[i] != 0:\n",
        "            features.iloc[:, i] = features.iloc[:, i].subtract(col_means[i]).divide(col_sds[i])\n",
        "    return features"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "50450605b5ef"
      },
      "outputs": [],
      "source": [
        "train.iloc[:, 1:] = standardize_features(\n",
        "    features=train.iloc[:, 1:],\n",
        "    col_means=train_column_means[1:],\n",
        "    col_sds=train_column_sd[1:])\n",
        "validation.iloc[:, 1:] = standardize_features(\n",
        "    features=validation.iloc[:, 1:],\n",
        "    col_means=train_column_means[1:],\n",
        "    col_sds=train_column_sd[1:])\n",
        "test.iloc[:, 1:] = standardize_features(\n",
        "    features=test.iloc[:, 1:],\n",
        "    col_means=train_column_means[1:],\n",
        "    col_sds=train_column_sd[1:])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "6167909c8ede"
      },
      "source": [
        "### Write data to CSV files and upload to Google Cloud Storage\n",
        "\n",
        "Use [TensorFlow's Gfile class](https://www.tensorflow.org/api_docs/python/tf/io/gfile/GFile) to copy the preprocessed CSV files to a GCS bucket."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "44ff6f882d79"
      },
      "outputs": [],
      "source": [
        "# Save preprocessed data as CSV files\n",
        "os.mkdir('data/preprocessed')\n",
        "train.to_csv('data/preprocessed/train.csv', index=False)\n",
        "validation.to_csv('data/preprocessed/validation.csv', index=False)\n",
        "test.to_csv('data/preprocessed/test.csv', index=False)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "1201564959d0"
      },
      "outputs": [],
      "source": [
        "# Copy the preprocessed CSV data to a GCS bucket\n",
        "for dataset in tf.io.gfile.glob('data/preprocessed/*.csv'):\n",
        "    tf.io.gfile.copy(\n",
        "        dataset,\n",
        "        os.path.join('gs://', BUCKET_NAME, 'house_prices_data', os.path.basename(dataset)),\n",
        "        overwrite=True)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "9ba4f1821851"
      },
      "source": [
        "### Submit a training job with the XGBoost AI Hub component\n",
        "\n",
        "To use an AI Hub component with the AI Platform Training service, navigate to the component's page and click the 'Edit Training Command' button. A pop-up will appear with a list of arguments that the component accepts and the Bash shell command for submitting a training job to AI Platform Training. \n",
        "\n",
        "The [XGBoost AI Hub component](https://aihub.cloud.google.com/u/0/p/products%2F0ccd8a63-71a7-4e48-a68b-685692a62e92) is a Docker image hosted on [Google Container Registry](https://console.cloud.google.com/gcr/images/aihub-c2t-containers/GLOBAL/kfp-components/trainer/dist_xgboost?gcrImageListsize=30). The component uses [AI Platform Training's custom container feature](https://cloud.google.com/ml-engine/docs/containers-overview) to run training jobs.\n",
        "\n",
        "The parameter values below can be revised for your use case and data. For additional information on submitting a training job on AI Platform Training refer to the [documentation](https://cloud.google.com/ml-engine/docs/training-jobs). To view the status of a training run and inspect the logs, navigate to the [GCP console](https://console.cloud.google.com) and go to the AI Platform > Jobs page."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "a2a9aa307685"
      },
      "outputs": [],
      "source": [
        "# Set parameter values for a training run\n",
        "TRAINING_DATA = os.path.join('gs://', BUCKET_NAME, 'house_prices_data/train*')\n",
        "VALIDATION_DATA = os.path.join('gs://', BUCKET_NAME, 'house_prices_data/val*')\n",
        "OUTPUT_LOCATION = os.path.join('gs://', BUCKET_NAME, 'xgboost_output')\n",
        "TARGET_COLUMN = 'SalePrice'\n",
        "DATA_TYPE = 'csv'\n",
        "FRESH_START = True\n",
        "WEIGHT_COLUMN = \"\"\n",
        "NUMBER_OF_CLASSES = 1\n",
        "NUM_ROUND = 250\n",
        "EARLY_STOPPING_ROUNDS = -1\n",
        "VERBOSITY = 1\n",
        "ETA = 0.1\n",
        "GAMMA = 0.001\n",
        "MAX_DEPTH = 10\n",
        "MIN_CHILD_WEIGHT = 1\n",
        "MAX_DELTA_STEP = 0\n",
        "SUBSAMPLE = 1\n",
        "COLSAMPLE_BYTREE = 1\n",
        "COLSAMPLE_BYLEVEL = 1\n",
        "COLSAMPLE_BYNODE = 1\n",
        "REG_LAMBDA = 1\n",
        "ALPHA = 0\n",
        "SCALE_POS_WEIGHT = 1\n",
        "OBJECTIVE = 'reg:gamma'\n",
        "TREE_METHOD = 'auto'\n",
        "\n",
        "# AI Platform Training job related arguments:\n",
        "SCALE_TIER = 'CUSTOM'\n",
        "MASTER_MACHINE_TYPE = 'standard_gpu'"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "f64834106ae6"
      },
      "outputs": [],
      "source": [
        "import uuid\n",
        "\n",
        "JOB_NAME = \"kaggle_xgboost_example_\" + uuid.uuid4().hex[:10]"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "f09af2a592ca"
      },
      "outputs": [],
      "source": [
        "# Submit AI Platform training job.\n",
        "\n",
        "!gcloud ai-platform jobs submit training {JOB_NAME} \\\n",
        "    --master-image-uri gcr.io/aihub-c2t-containers/kfp-components/trainer/dist_xgboost@sha256:7de885ef326e55b663ff0eb06724d580116953fe6a702383a113b2f306f308ae \\\n",
        "    --region {REGION} \\\n",
        "    --scale-tier {SCALE_TIER} \\\n",
        "    --master-machine-type {MASTER_MACHINE_TYPE} \\\n",
        "    --stream-logs \\\n",
        "    -- \\\n",
        "    --training-data {TRAINING_DATA} \\\n",
        "    --target-column {TARGET_COLUMN} \\\n",
        "    --validation-data {VALIDATION_DATA} \\\n",
        "    --output-location {OUTPUT_LOCATION} \\\n",
        "    --data-type {DATA_TYPE} \\\n",
        "    --fresh-start {FRESH_START} \\\n",
        "    --weight-column {WEIGHT_COLUMN} \\\n",
        "    --number-of-classes {NUMBER_OF_CLASSES} \\\n",
        "    --num-round {NUM_ROUND} \\\n",
        "    --early-stopping-rounds {EARLY_STOPPING_ROUNDS} \\\n",
        "    --verbosity {VERBOSITY} \\\n",
        "    --eta {ETA} \\\n",
        "    --gamma {GAMMA} \\\n",
        "    --max-depth {MAX_DEPTH} \\\n",
        "    --min-child-weight {MIN_CHILD_WEIGHT} \\\n",
        "    --max-delta-step {MAX_DELTA_STEP} \\\n",
        "    --subsample {SUBSAMPLE} \\\n",
        "    --colsample-bytree {COLSAMPLE_BYTREE} \\\n",
        "    --colsample-bylevel {COLSAMPLE_BYLEVEL} \\\n",
        "    --colsample-bynode {COLSAMPLE_BYNODE} \\\n",
        "    --reg-lambda {REG_LAMBDA} \\\n",
        "    --alpha {ALPHA} \\\n",
        "    --scale-pos-weight {SCALE_POS_WEIGHT} \\\n",
        "    --objective {OBJECTIVE} \\\n",
        "    --tree-method {TREE_METHOD} \\\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "6b2457c6fb36"
      },
      "source": [
        "### Deploy the trained model to AI Platform Prediction\n",
        "\n",
        "After the training run succeeds a model file (`model.bst`) will be exported to the GCS bucket defined by the `OUTPUT_LOCATION` parameter. Create a model resource on AI Platform Prediction and deploy a new version using the trained XGBoost model."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "6d4aa3aaebe1"
      },
      "outputs": [],
      "source": [
        "MODEL_NAME = 'xgboost_housing_price_predictor'\n",
        "MODEL_VERSION = 'v1'"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "aa396b13f031"
      },
      "outputs": [],
      "source": [
        "# Delete model version resource\n",
        "! gcloud ai-platform versions delete {MODEL_VERSION} --quiet --model {MODEL_NAME} \n",
        "\n",
        "# Delete model resource\n",
        "! gcloud ai-platform models delete {MODEL_NAME} --quiet"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "433a9a8eedb8"
      },
      "outputs": [],
      "source": [
        "# Create a model resource on AI Platform Prediction. Once this is created, multiple versions\n",
        "# of a model can be uploaded to this resource.\n",
        "!gcloud ai-platform models create {MODEL_NAME} --regions {REGION}"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "f2c1d4ffdab0"
      },
      "outputs": [],
      "source": [
        "# Create a model version using the exported XGBoost model from the training run\n",
        "!gcloud ai-platform versions create {MODEL_VERSION} \\\n",
        "  --model {MODEL_NAME} \\\n",
        "  --origin {OUTPUT_LOCATION} \\\n",
        "  --runtime-version=1.14 \\\n",
        "  --python-version=3.5 \\\n",
        "  --framework XGBOOST "
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "a95b2179299e"
      },
      "outputs": [],
      "source": [
        "# Verify that the model endpoint was created successfully\n",
        "!gcloud ai-platform versions describe {MODEL_VERSION} \\\n",
        "  --model {MODEL_NAME}"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "381120b87c9f"
      },
      "source": [
        "### Generate inferences from the model endpoint\n",
        "\n",
        "Once the model is deployed to AI Platform Prediction, the endpoint can be used to serve inferences. Refer to the [documentation](https://cloud.google.com/ml-engine/docs/online-predict) for additional information on generating online predictions from an AI Platform endpoint."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "55ca669c1d68"
      },
      "outputs": [],
      "source": [
        "service = googleapiclient.discovery.build('ml', 'v1')\n",
        "name = 'projects/{}/models/{}/versions/{}'.format(PROJECT_ID, MODEL_NAME, MODEL_VERSION)\n",
        "\n",
        "response = service.projects().predict(\n",
        "    name=name,\n",
        "    # Generate inferences for the first 10 examples from the test set\n",
        "    body={'instances': test.iloc[0:10, :].values.tolist()}\n",
        ").execute()\n",
        "\n",
        "if 'error' in response:\n",
        "    print(response['error'])\n",
        "else:\n",
        "    online_results = response['predictions']\n",
        "    print(online_results)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "82d7882862dd"
      },
      "source": [
        "### Inspect the training job\n",
        "\n",
        "After the training job completes on AI Platform Training a run report will be created in the `OUTPUT_LOCATION`. The report examines the quality of the trained model and provides a visual inspection of the training and validation error from the training run."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "f2cca1ea1069"
      },
      "outputs": [],
      "source": [
        "tf.io.gfile.copy(\n",
        "    os.path.join(OUTPUT_LOCATION, 'report.html'),\n",
        "    'report.html',\n",
        "    overwrite=True)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "fb634831beb0"
      },
      "outputs": [],
      "source": [
        "with open('report.html', 'r') as f:\n",
        "    html_report = f.read()\n",
        "\n",
        "display(HTML(html_report))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "TpV-iwP9qw9c"
      },
      "source": [
        "### Cleaning up\n",
        "\n",
        "To clean up all GCP resources used in this project, you can [delete the GCP\n",
        "project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "sx_vKniMq9ZX"
      },
      "outputs": [],
      "source": [
        "# Delete model version resource\n",
        "! gcloud ai-platform versions delete {MODEL_VERSION} --quiet --model {MODEL_NAME} \n",
        "\n",
        "# Delete model resource\n",
        "! gcloud ai-platform models delete {MODEL_NAME} --quiet\n",
        "\n",
        "# If training job is still running, cancel it\n",
        "! gcloud ai-platform jobs cancel {JOB_NAME} --quiet"
      ]
    }
  ],
  "metadata": {
    "colab": {
      "collapsed_sections": [],
      "name": "training_an_xgboost_model_with_ai_hub.ipynb",
      "toc_visible": true
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}
